Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.6 - Filtering Data")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
Filtering Data
Again another commonly used function in data analysis, filtering out unwanted rows.
Option 1 - where()
(
pets
.where(F.col('breed_id') == 1)
.filter(F.col('color') == 'brown')
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
What Happened?
Similar to the functions we have seen so far, there are multiple functioned that get alias
to different names that perform the same transformation. IMO I perfor where
as it's a bit more intuitive and closer to the sql
syntax.
Note:
Notice how we don't have to wrap 1
or brown
in a F.lit()
function as these conditions are columnary expressions.
We will look into how to perform more complex conditions in 2.1.7
that contain more than 1 condition.
Option 2 - isin()
(
pets
.where(F.col('nickname').isin('King', 'Argus'))
.toPandas()
)
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
What Happened?
If you want to know if a column can be of many values then you can use the isin()
function. This function takes in both a list of values of comma seperated values. This is again very similar to sql
syntax.
Summary
- We learnt of two filter functions in Spark
where()
andisin()
. - Using
isin
you can see if a column can contain multiple values. - These functions are named similarly to a
sql
language.